{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Word-level language modeling using PyTorch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Contents\n",
    "\n",
    "1. [Background](#Background)\n",
    "1. [Setup](#Setup)\n",
    "1. [Data](#Data)\n",
    "1. [Train](#Train)\n",
    "1. [Host](#Host)\n",
    "\n",
    "---\n",
    "\n",
    "## Background\n",
    "\n",
    "This example trains a multi-layer LSTM RNN model on a language modeling task based on [PyTorch example](https://github.com/pytorch/examples/tree/master/word_language_model). By default, the training script uses the Wikitext-2 dataset. We will train a model on SageMaker, deploy it, and then use deployed model to generate new text.\n",
    "\n",
    "For more information about the PyTorch in SageMaker, please visit [sagemaker-pytorch-containers](https://github.com/aws/sagemaker-pytorch-containers) and [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk) github repositories.\n",
    "\n",
    "---\n",
    "\n",
    "## Setup\n",
    "\n",
    "_This notebook was created and tested on an ml.p2.xlarge notebook instance._\n",
    "\n",
    "Let's start by creating a SageMaker session and specifying:\n",
    "\n",
    "- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.\n",
    "- The IAM role arn used to give training and hosting access to your data. See [the documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the sagemaker.get_execution_role() with appropriate full IAM role arn string(s).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sagemaker\n",
    "\n",
    "sagemaker_session = sagemaker.Session()\n",
    "\n",
    "bucket = sagemaker_session.default_bucket()\n",
    "prefix = 'sagemaker/DEMO-pytorch-rnn-lstm'\n",
    "\n",
    "role = sagemaker.get_execution_role()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data\n",
    "### Getting the data\n",
    "As mentioned above we are going to use [the wikitext-2 raw data](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/). This data is from Wikipedia and is licensed CC-BY-SA-3.0. Before you use this data for any other purpose than this example, you should understand the data license, described at https://creativecommons.org/licenses/by-sa/3.0/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "wget http://research.metamind.io.s3.amazonaws.com/wikitext/wikitext-2-raw-v1.zip\n",
    "unzip -n wikitext-2-raw-v1.zip\n",
    "cd wikitext-2-raw\n",
    "mv wiki.test.raw test && mv wiki.train.raw train && mv wiki.valid.raw valid\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's preview what data looks like."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!head -5 wikitext-2-raw/train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Uploading the data to S3\n",
    "We are going to use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use later when we start the training job.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "inputs = sagemaker_session.upload_data(path='wikitext-2-raw', bucket=bucket, key_prefix=prefix)\n",
    "print('input spec (in this case, just an S3 path): {}'.format(inputs))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train\n",
    "### Training script\n",
    "We need to provide a training script that can run on the SageMaker platform. The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as:\n",
    "\n",
    "* `SM_MODEL_DIR`: A string representing the path to the directory to write model artifacts to.\n",
    "  These artifacts are uploaded to S3 for model hosting.\n",
    "* `SM_OUTPUT_DATA_DIR`: A string representing the filesystem path to write output artifacts to. Output artifacts may\n",
    "  include checkpoints, graphs, and other files to save, not including model artifacts. These artifacts are compressed\n",
    "  and uploaded to S3 to the same S3 prefix as the model artifacts.\n",
    "\n",
    "Supposing one input channel, 'training', was used in the call to the PyTorch estimator's `fit()` method,\n",
    "the following will be set, following the format `SM_CHANNEL_[channel_name]`:\n",
    "\n",
    "* `SM_CHANNEL_TRAINING`: A string representing the path to the directory containing data in the 'training' channel.\n",
    "\n",
    "A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to `model_dir` so that it can be hosted later. Hyperparameters are passed to your script as arguments and can be retrieved with an `argparse.ArgumentParser` instance. For example, the script run by this notebook:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pygmentize 'source/train.py'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For more information about training environment variables, please visit [SageMaker Containers](https://github.com/aws/sagemaker-containers)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the current example we also need to provide source directory since training script imports data and model classes from other modules."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!ls source"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run training in SageMaker\n",
    "The PyTorch class allows us to run our training function as a training job on SageMaker infrastructure. We need to configure it with our training script and source directory, an IAM role, the number of training instances, and the training instance type. In this case we will run our training job on ```ml.p2.xlarge``` instance. As you can see in this example you can also specify hyperparameters. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.pytorch import PyTorch\n",
    "\n",
    "estimator = PyTorch(entry_point='train.py',\n",
    "                    role=role,\n",
    "                    framework_version='1.1.0',\n",
    "                    train_instance_count=1,\n",
    "                    train_instance_type='ml.p2.xlarge',\n",
    "                    source_dir='source',\n",
    "                    # available hyperparameters: emsize, nhid, nlayers, lr, clip, epochs, batch_size,\n",
    "                    #                            bptt, dropout, tied, seed, log_interval\n",
    "                    hyperparameters={\n",
    "                        'epochs': 6,\n",
    "                        'tied': True\n",
    "                    })\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After we've constructed our PyTorch object, we can fit it using the data we uploaded to S3. SageMaker makes sure our data is available in the local filesystem, so our training script can simply read the data from disk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "estimator.fit({'training': inputs})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Host\n",
    "### Hosting script\n",
    "We are going to provide custom implementation of `model_fn`, `input_fn`, `output_fn` and `predict_fn` hosting functions in a separate file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pygmentize 'source/generate.py'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also put your training and hosting code in the same file but you would need to add a main guard (`if __name__=='__main__':`) for the training code, so that the container does not inadvertently run it at the wrong point in execution during hosting."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Import model into SageMaker\n",
    "The PyTorch model uses a npy serializer and deserializer by default. For this example, since we have a custom implementation of all the hosting functions and plan on using JSON instead, we need a predictor that can serialize and deserialize JSON."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.predictor import RealTimePredictor, json_serializer, json_deserializer\n",
    "\n",
    "class JSONPredictor(RealTimePredictor):\n",
    "    def __init__(self, endpoint_name, sagemaker_session):\n",
    "        super(JSONPredictor, self).__init__(endpoint_name, sagemaker_session, json_serializer, json_deserializer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since hosting functions implemented outside of train script we can't just use estimator object to deploy the model. Instead we need to create a PyTorchModel object using the latest training job to get the S3 location of the trained model data. Besides model data location in S3, we also need to configure PyTorchModel with the script and source directory (because our `generate` script requires model and data classes from source directory), an IAM role."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.pytorch import PyTorchModel\n",
    "\n",
    "training_job_name = estimator.latest_training_job.name\n",
    "desc = sagemaker_session.sagemaker_client.describe_training_job(TrainingJobName=training_job_name)\n",
    "trained_model_location = desc['ModelArtifacts']['S3ModelArtifacts']\n",
    "model = PyTorchModel(model_data=trained_model_location,\n",
    "                     role=role,\n",
    "                     framework_version='1.0.0',\n",
    "                     entry_point='generate.py',\n",
    "                     source_dir='source',\n",
    "                     predictor_cls=JSONPredictor)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create endpoint\n",
    "\n",
    "Now the model is ready to be deployed at a SageMaker endpoint and we are going to use the `sagemaker.pytorch.model.PyTorchModel.deploy` method to do this. We can use a CPU-based instance for inference (in this case an ml.m4.xlarge), even though we trained on GPU instances, because at the end of training we moved model to cpu before returning it. This way we can load trained model on any device and then move to GPU if CUDA is available. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictor = model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluate\n",
    "We are going to use our deployed model to generate text by providing random seed, temperature (higher will increase diversity) and number of words we would like to get."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "input = {\n",
    "    'seed': 111,\n",
    "    'temperature': 2.0,\n",
    "    'words': 100\n",
    "}\n",
    "response = predictor.predict(input)\n",
    "print(response)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cleanup\n",
    "\n",
    "After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sagemaker_session.delete_endpoint(predictor.endpoint)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Environment (conda_pytorch_p27)",
   "language": "python",
   "name": "conda_pytorch_p27"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.14"
  },
  "notice": "Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.  Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
